library(tidyr)
library(dplyr)
library(readr)
relig_income
这个数据从本质上来说包括三个变量: 1. religion, 分布在行上 2. income,分布在列名上 3. count,分布在单元格中
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
主力函数pivot_longer参数: 1. 数据本身 2. 指定哪些列需要变形(reshape), 本例中是所有列,除了第一列。 3. names_to 指定变形之后的列名 4. values_to 指定单元格数据的列名
注意 names_to和values_to都是原数据relig_income中没有的列名。
billboard数据搜集了2000年歌曲排行版的数据。与上一个数据的情况类似, 但是存在列名上的不再是字符串,而应该是数值(星期几)
billboard
wkn我们希望是一个变量week,其单元格对应的取值是排行榜上的位置,所以应该是一个新的变量rank。另外,这个数据有很多缺失值,不是所有的歌曲都在排行榜上呆满了76周。
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
week应该是一个数值变量
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
names_ptypes = list(week = integer()),
values_to = "rank",
values_drop_na = TRUE,
)
who
从 new_sp_m014 到 newrel_f65,每一列其实都包含四个变量信息: 1. new_或者new:是否为新观测值,可能是一个哑变量,但本数据中全部都是新观测,因此该变量可以忽略。 2. sp/rel/sp/ep: 3. m/f:性别 4. 014/1524/2535/3544/4554/65: 年龄区间
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
进一步,明确gender和age的因子格式
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_ptypes = list(
gender = factor(levels = c("f", "m")),
age = factor(
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family
family %>%
pivot_longer(
-family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
注意.value的使用,表示该变量本身就是观测值。
再看一个例子
anscombe
ans
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
另外一个例子:面板数据雏形
pnl <- tibble(
x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
pnl
pnl %>%
pivot_longer(
-c(x, a, b),
names_to = c(".value", "time"),
names_pattern = "(.)(.)"
)
df <- tibble(x = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal")
df
df %>% pivot_longer(-x, names_to = "name", values_to = "value")
## Warning: Duplicate column names detected, adding .copy variable
相对来说,longer数据比wider数据更干净(tidy), 符合tidyr和类似plm包的数据格式要求。但有些时候,宽数据更适合做数据展示。其他软件也可能要求数据输入格式为wider。
fish_encounters
很多分析方法要求将station放在列上(变宽):
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
很多缺失值NA,为什么?其实不是确实,而是有真实含义。因此:
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
)
warpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks)
warpbreaks
warpbreaks %>% count(wool, tension)
以下为什么报警?
warpbreaks %>% pivot_wider(names_from = wool, values_from = breaks)
## Warning: Values in `breaks` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list(breaks = list)` to suppress this warning.
## * Use `values_fn = list(breaks = length)` to identify where the duplicates arise
## * Use `values_fn = list(breaks = summary_fun)` to summarise duplicates
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = mean)
)
production <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
dplyr::filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
production
production %>% pivot_wider(
names_from = c(product, country),
values_from = production
)
us_rent_income 包括2017年美国收入和租金支出数据。
us_rent_income
这个数据有什么不干净的地方?
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
contacts
没有id标识
contacts <- contacts %>%
mutate(
person_id = cumsum(field == "name")
)
contacts
contacts %>%
pivot_wider(names_from = field, values_from = value)
有些问题不能通过一步拉长(宽)完成,需要组合两种pivot操作: world_bank_pop数据包含2000年2018年世界银行关于各国人口的数据
world_bank_pop
这个数据很不干净,我们从最明显的问题开始处理: - 第一个问题:年份跨列,这应该是个longer操作
pop2 <- world_bank_pop %>%
pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value")
pop2
pop2 %>% count(indicator)
“SP.POP.GROW is population growth, SP.POP.TOTL is total population, and SP.URB.* are the same but only for urban areas.”
应该有两个变量:area,表示total和urban; 还有两个实际变量:population和growth
首先需要用到separate
pop3 <- pop2 %>%
separate(indicator, c(NA, "area", "variable"))
pop3
pop3 %>%
pivot_wider(names_from = variable, values_from = value)
multi <- tribble(
~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
)
multi
先longer
multi2 <- multi %>%
pivot_longer(-id, values_drop_na = TRUE) %>%
mutate(checked = TRUE)
multi2
再wider
multi2 %>%
pivot_wider(
id_cols = id,
names_from = value,
values_from = checked,
values_fill = list(checked = FALSE)
)